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Abstract 

The purpose of mid-level visual element discovery is to 
find clusters of image patches that are representative of 
and which discriminate between, the contents of the rele¬ 
vant images. Here we propose a pattern-mining approach 
to the problem of identifying mid-level elements within im¬ 
ages, motivated by the observation that such techniques 
have been very effective, and efficient, in achieving simi¬ 
lar goals when applied to other data types. We show that 
CNN activations extracted from image patches typical pos¬ 
sess two appealing properties that enable seamless integra¬ 
tion with pattern mining techniques. The marriage between 
CNN activations and a pattern mining technique leads to 
fast and effective discovery of representative and discrimi¬ 
native patterns from a huge number of image patches, from 
which mid-level elements are retrieved. Given the patterns 
and retrieved mid-level visual elements, we propose two 
methods to generate image feature representations. The first 
encoding method uses the patterns as codewords in a dictio¬ 
nary in a manner similar to the Bag-of-Visual-Words model. 
We thus label this a Bag-of-Patterns representation. The 
second relies on mid-level visual elements to construct a 
Bag-of-Elements representation. We evaluate the two en¬ 
coding methods on object and scene classification tasks, 
and demonstrate that our approach outperforms or matches 
the performance of the state-of-the-arts on these tasks. 

keywords: Mid-level visual element discovery, pattern 
mining, convolutional neural networks 
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1. Introduction 

Image patches that capture important aspects of objects 
are crucial to a variety of state-of-the-art object recogni¬ 
tion systems. For instance, in the Deformable Parts Model 
(DPM) [31] such image patches represent object parts that 
are treated as latent variables in the training process. In 
Poselets [12], such image patches are used to represent hu¬ 
man body parts, which have been shown to be beneficial for 
human detection [10] and human attribute prediction [11] 
tasks. Yet, obtaining these informative image patches in 
both DPM and Poselets require extensive human annota¬ 
tions (DPM needs object bounding boxes while the Pose¬ 
lets model needs the information of human body keypoints). 
Clearly, the discovery of these representative image patches 
with minimal human supervision would be desirable. Stud¬ 
ies on mid-level visual elements (a.k.a, mid-level discrimi¬ 
native patches) offer one possible solution to this problem. 

Mid-level visual elements are clusters of image patches 
discovered from a dataset where only image labels are avail¬ 
able. As noted in the pioneering work of [80], such patch 
clusters are suitable for interpretation as mid-level visual 
elements only if they satisfy two requirements, i.e., repre¬ 
sentativeness and discriminativeness. Representativeness 
requires that mid-level visual elements should frequently 
occur in the images with same label {i.e., target category), 
while discriminativeness implies that they should be seldom 
found in images not containing the object of interest. For in¬ 
stance, image patches containing the wheel of a car may be 
a mid-level visual element for the car category, as most car 
images contain wheels, and car wheels are seldom found in 
images of other objects (this implies also that they are visu¬ 
ally distinct from other types of wheels). The discovery of 
mid-level visual elements has boosted performance in a va¬ 
riety of vision tasks, such as image classification [24,50,80] 
and action recognition [47,91]. 

As another line of research, pattern mining techniques 
have also enjoyed popularity amongst the computer vision 
community, including image classification [32,33,87,95], 
image retrieval [34] and action recognition [38,39], largely 
to due to their capability of discovering informative patterns 
hidden inside massive of data. 

In this paper, we address mid-level visual element dis¬ 
covery from a pattern mining perspective. The novelty in 
our approach of is that it systematically brings together 
Convolutional Neural Networks (CNN) activations and as¬ 
sociation rule mining, a well-known pattern mining tech¬ 
nique. Specifically, we observe that for an image patch, ac¬ 
tivations extracted from fully-connected layers of a CNN 
possess two appealing properties which enable their seam¬ 
less integration with this pattern mining technique. Based 
on this observation, we formulate mid-level visual ele¬ 
ment discovery from the perspective of pattern mining and 
propose a Mid-level Deep Pattern Mining (MDPM) algo- 
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Figure 1. Name that Object: Given the mid-level visual elements 
discovered by our algorithm from the Pascal VOC 2007 dataset, 
can you guess what categories are they from? 

rithm that effectively and efficiently discovers representa¬ 
tive and discriminative patterns from a huge number of im¬ 
age patches. When we retrieve and visualize image patches 
with the same pattern, it turns out that they are not only vi¬ 
sually similar, but also semantically consistent (see by way 
of example the game in Fig. 1 and then check your answers 
below'). 

Relying on the discovered patterns and retrieved mid¬ 
level visual elements, we propose two methods to generate 
image features for each of them (Sec. 5). For the first feature 
encoding method, we compute a Bag-of-Patterns represen¬ 
tation which is motivated by the well-known Bag-of-Visual- 
Words representation [81]. For the second method, we first 
merge mid-level visual elements and train detectors simulta¬ 
neously, followed by the construction of a Bag-of-Elements 
representation. We evaluate the proposed feature represen¬ 
tations on generic object and scene classification tasks. Our 
experiments demonstrate that the classification performance 
of the proposed feature representation not only outperforms 
all current methods in mid-level visual element discovery 
by a noticeable margin with far fewer elements used, but 
also outperform or match the performance of state-of-the- 
arts using CNNs for the same task. 

In summary, the merits of the proposed approach can be 
understood from different prospectives. 

• Ejficient handling of massive image patches. As noted 
by [24], one of the challenges in mid-level visual ele¬ 
ment discovery is the massive amount of random sam¬ 
pled patches to go through. However, pattern mining 
techniques are designed to handle large data sets, and 
are extremely capable of doing so. In this sense, if ap¬ 
propriately employed, pattern mining techniques can 
be a powerful tool for overcoming this data deluge in 
mid-level visual element discovery. 

• A straightforward interpretation of representativness 
and discriminativeness. In previous works on mid¬ 
level visual element discovery, different methods have 
been proposed for interpreting the dual requirements 

'Answer key: 1. aeroplane, 2. train, 3. cow, 4. motorbike, 5. bike, 
6. sofa. 
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of representativeness and discriminativeness. Here in 
this work, interpreting these two requirements in the 
pattern mining terminology is straightforward. To our 
knowledge, we are the first to formulate mid-level vi¬ 
sual element discovery from the perspective of pattern 
mining. 

• Feature encoder of CNN activations of image patches. 
Recent state-of-the-art results on many image classi¬ 
fication tasks {e.g., indoor scene, object, texture) are 
achieved by applying classical feature encoding meth¬ 
ods [48, 69] on the top of CNN activations of image 
patches [17,18,42]. In our work, we demonstrate that 
mid-level visual elements, which are discovered by the 
proposed MDPM algorithm, can also be a good alter¬ 
native feature encoder for CNN activations of image 
patches. 


The remainder of the paper is organized as follows. In 
Sec. 2, we review some of the related work on mid-level 
visual element discovery as well as relevant vision applica¬ 
tions. In Sec. 3 we explain some of the relevant pattern min¬ 
ing terminology and how pattern mining techniques have 
been successfully applied to computer vision tasks previ¬ 
ously. The details of our MDPM algorithm are provided 
in Sec. 4. In particular, we start by introducing two de¬ 
sirable properties of CNN activations extracted from image 
patches (Sec. 4.1), which serve as the cornerstones of the 
proposed MDPM algorithm. In Sec. 5, we apply the dis¬ 
covered patterns and mid-level visual elements to generate 
image feature representations, followed by extensive exper¬ 
imental validations in Sec. 6. Some further discussions are 
presented in Sec. 7 and we conclude the paper in Sec. 8. 

Preliminary results of this work appeared in [55]. In this 
paper, we extend [55] in the following aspects. Firstly, for 
the theory part, we propose a new method to generate im¬ 
age representations using the discovered patterns {i.e., the 
Bag-of-Patterns representation). Furthermore, more exten¬ 
sive experiment are presented in this manuscript, such as 
more detailed analysis of different components of the pro¬ 
posed framework. Last but not least, we present a new ap¬ 
plication of mid-level visual elements, which is the anal¬ 
ysis of the role of context information using mid-level vi¬ 
sual elements (Sec. 6.4). At the time of preparing of this 
manuscript, we are aware of at least two works [22, 65] 
which are built on our previous work [55] in different vision 
applications, including human action and attribute recog¬ 
nition [22] and modeling visual compatibility [65], which 
reflects that our work is valuable to the computer vision 
community. Our code is available at https : //github . 
com/yaoliUoA/MDPM. 


2. Related work 

2.1. Mid-level visual elements 

Mid-level visual features have been widely used in com¬ 
puter vision, which can be constructed by different meth¬ 
ods, such as supervised dictionary learning [13], hierarchi¬ 
cally encoding of low-level descriptors [1, 33, 78] and the 
family of mid-level visual elements [24,50,80]. As the dis¬ 
covery of mid-level visual elements is the very topic of this 
paper, we mainly discuss previous works on this topic. 

Mid-level visual element discovery has been shown to 
be beneficial to image classification tasks, including scene 
categorization [9,24,50,54,55,62,67,80,83,92] and fine¬ 
grained categorization [93]. For this task, there are three 
key steps, (1) discovering candidates of mid-level visual el¬ 
ements, (2) selecting a subset of the candidates, and finally 
(3) generating image feature representations. 

In the first step, various methods have been proposed 
in previous work to discover candidates of mid-level vi¬ 
sual elements in previous works. Usually starting from 
random sampled patches which are weakly-labeled {e.g., 
image-level labels are known), candidates are discovered 
from the target category by different methods, such as cross- 
validation training patch detectors [80], training Exemplar 
LDA detectors [50], discriminative mode seeking [24], min¬ 
imizing a latent SVM object function with a group sparsity 
regularizer [83,84], and the usage of Random Forest [9]. In 
this work, we propose a new algorithm for discovering the 
candidates from a pattern mining perspective (Sec. 4). 

The goal of the second step is to select mid-level visual 
elements from a large pool of candidates, which can best 
interpret the requirements of representative and discrimi¬ 
native. Some notable criteria in previous includes a com¬ 
bination of purity and discriminativeness scores [80], en¬ 
tropy ranking [50,53]. the Purity-Coverage plot [24] and 
the squared whitened norm response [4, 5]. In our work, 
we select mid-level visual elements from the perspective of 
pattern selection (Sec. 5.1.1) and merging (Sec. 5.2.1). 

As for the final step of generating image feature repre¬ 
sentation for classification, most previous works [24,50,80] 
follow the same principle, that is, the combination of max¬ 
imum detection scores of all mid-level elements from dif¬ 
ferent categories in a spatial pyramid [52]. This encoding 
method is also adopted in our work (Sec. 5.2.2). 

In addition to image classification, some works apply 
mid-level visual elements to other vision tasks as well, in¬ 
cluding visual data mining [25,74], action recognition [47, 
91], discovering stylistic elements [53], scene understand¬ 
ing [35,36,66], person re-identification [99], image re-rank- 
ing [20], weakly-supervised object detection [82]. In object 
detection, before the popularity of R-CNN [41], approaches 
on object detection by learning a collection of mid-level de¬ 
tectors are illustrated by [7,27,76]. 
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2.2. Pattern mining in computer vision 

Pattern mining techniques, such as frequent itemset min¬ 
ing and its variants, have been studied primarily amongst 
the data mining community, but a growing number of appli¬ 
cations can be found in the computer vision community. 

Early works have used pattern mining techniques in ob¬ 
ject recognition tasks, such as finding frequent co-occurrent 
visual words [97] and discovering distinctive feature con¬ 
figurations [70]. Later on, for recognizing human-object 
interactions, [95] introduce ‘gouplets’ discovered in a pat¬ 
tern mining algorithm, which encodes appearance, shape 
and spatial relations of multiple image patches. For 3D 
human action recognition, discriminative actionlets are dis¬ 
covered in a pattern mining fashion [89]. By finding closed 
patterns from local visual word histograms, [32, 33] intro¬ 
duce Frequent Local Histograms (FLHs) which can be uti¬ 
lized to generate new image representation for classifica¬ 
tion. Another interesting work is [87] in which images are 
represented by histograms of pattern sets. Relying on a pat¬ 
tern mining technique, [34] illustrate how to address the 
image retrieval problem using mid-level patterns. More re¬ 
cently, [74] design a method for summarizing image collec¬ 
tions using closed patterns. Pattern mining techniques have 
been also successfully applied to some other vision prob¬ 
lems, such as action recognition in videos [38,39]. 

For the image classification task, most of the aforemen¬ 
tioned works are relying on hand-crafted features, espe¬ 
cially Bag-of-visual-words [81], for pattern mining. In con¬ 
trast, to our knowledge, we are first to describe how pattern 
mining techniques can be combine with the state-of-the-art 
CNN features, which have been widely applied in computer 
vision nowadays. Besides, our work can be viewed as a 
new application of pattern mining techniques in vision, that 
is, the discovery of mid-level visual elements. 


T G V which contain P. The support of P reflects this 
quantity: 


supp(P) 


|{T|r€27, peril ^ 


( 1 ) 


where | • | measures the cardinality. P is called frequent 
itemset when supp(P) is larger than a predefined threshold. 


Association rule. An association rule I ^ a implies a re¬ 
lationship between pattern P (antecedents) and an item a 
(consequence). We are interested in how likely it is that a 
is present in the transactions which contain P within T) . 
In a typical application this might be taken to imply that 
customers who bought items in P are also likely to buy 
item o, for instance. The confidence of an association rule 
conf(P —> a) can be taken to reflect this probability: 

supp(PU{a}) 

^ ^ supp(P) 

^ \{T\TeV,iPU{a})CT}\ 

|{T|T e V,P CT}\ ^ 

( 2 ) 

In practice, we are interested in “good” rules, meaning 
that the confidence of these rules should be reasonably high. 

A running example. Consider the case when there are 4 
items in the set {i.e., A = { 01 , 02 , 03 , 04 }) and 5 transactions 
in V, 

• Ti = {03,04}, 

• 72 = {01,02,04}, 

• Ts = {01,04}, 


3. Background on pattern mining 
3.1. Terminology 

Originally developed for market basket analysis, fre¬ 
quent itemset and association rule are well-known termi¬ 
nologies within data mining. Both might be used in pro¬ 
cessing large numbers of customer transactions to reveal in¬ 
formation about their shopping behaviour, for example. 

More formally, let A = {01,02,... , om } denote a set of 
M items. A transaction P is a subset of A (i.e., T C A) 
which contains only a subset of items (|T| <C M). We also 
define a transaction database T> = {Ti,T 2 ,... ,T/v} con¬ 
taining N (typically millions, or more) transactions. Given 
these definitions, the frequent itemset and association rule 
are defined as follows. 

Frequent itemset. A pattern P is also a subset of A (i.e., 
itemset). We are interested in the fraction of transactions 


• r 4 = {01,03,04}, 

• Tg = {01,02,03,04}, 

The value of supp({oi, 04 }) is 0.8 as the itemset (pat¬ 
tern) { 01 , 04 } appears in 4 out of 5 transactions (i.e., 
{ 72573 , 74 ,Ts}). The confidence value of the rule 
{ 01 , 04 } —)■ 03 is 0.5 (i.e., conf({oi, 04 } —> 03 ) = 0.5) 
as 50% of the transactions containing { 01 , 04 } also contains 
the item 03 (i.e., {r 4 ,T 5 }). 

3.2. Algorithms 

The Apriori algorithm [3] is the most renowned pattern 
mining technique for discovering frequent itemsets and as¬ 
sociation rules from a huge number of transactions. It em¬ 
ploys a breadth-first, bottom-up strategy to explore item 
sets. Staring from an item, at each iteration the algorithm 
checks the frequency of a subset of items in the transactions 
with the same item set size, and then only the ones whose 
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support values exceed a predefined threshold are retained, 
followed by increasing the item set size by one. The Apriori 
algorithm relies on the heuristic that if an item set does not 
meet the threshold, none of its supersets can do so. Thus 
the search space can be dramatically reduced. For com¬ 
puter vision applications, the Apriori algorithm has been 
used by [70,95] and [39]. 

There are also some other well-known pattern mining 
techniques, such as the FP-growth [43], LCM [86], DDP- 
Mine [15] and KRIMP [88] algorithms. These pattern min¬ 
ing techniques have also been adopted in computer vision 
research [32-34,74,97]. In this work, we opt for the Apri¬ 
ori algorithm for pattern mining. 

3.3. Challenges 

Transaction creation. The process of transforming data 
into a set of transactions is the most crucial step in apply¬ 
ing such pattern mining techniques for vision applications. 
Ideally, the representation of the data in this format should 
allow all of the relevant information to be represented, with 
no information loss. However, as noted in [87], there are 
two strict requirements of pattern mining techniques that 
make creating transactions with no information loss very 
challenging. 

1. Each transaction can only have a small number of 
items, as the potential search space grows exponen¬ 
tially with the number of items in each transaction. 

2. What is recorded in a transaction must be a set of inte¬ 
gers (which are typically the indices of items). 

As we will show in the next section, thanks to two 
appealing properties of CNN activations (Sec. 4.1), these 
two requirements can be fulfilled effortlessly if one uses 
CNN activations to create transactions. 

Pattern explosion. Known as pattern explosion in the pat¬ 
tern mining literature, the number of patterns discovered 
with a pattern mining technique can be enormous, with 
some of the patterns being highly correlated. Therefore, 
before using patterns for applications, the first step is pat¬ 
tern selection, that is, to select a subset of patterns which 
are both discriminative and not redundant. 

For the task of pattern selection, some heuristic rules are 
proposed in previous works. For instance, [97] compute a 
likelihood ratio to select patterns. [32, 33] use a combina¬ 
tion of discriminativity scores and representativity scores 
to select patterns. [74], instead, propose a pattern inter¬ 
estingness criterion and a greedy algorithm for selecting 
patterns. Instead of a two-step framework which includes 
pattern mining and selection, some previous works in pat¬ 
tern mining [15,88] propose to find discriminative patterns 
within the pattern mining algorithm itself, thus avoid the 


problem of pattern explosion and relieve the need of pattern 
selection. In this work, to address the problem of pattern ex¬ 
plosion, we advocate merging patterns describing the same 
visual concept rather than selecting a subset of patterns. 

4. Mid-level deep pattern mining 

An overview of the proposed the MDPM algorithm is il¬ 
lustrated in Fig. 2. Assuming that image labels are known, 
we start by sampling a huge number of random patches both 
from images of the target category (e.g., car) and images 
that do not contain the target category (i.e., the background 
class). With the two appealing properties of CNN activa¬ 
tions of image patches (Sec. 4.1), we then create a trans¬ 
action database in which each transaction corresponds to a 
particular image patch (Sec. 4.2). Patterns are then discov¬ 
ered from the transaction database using association rule 
mining (Sec. 4.3), from which mid-level visual elements 
can be retrieved efficiently (Sec. 4.4). 

4.1. Properties of CNN activation of patches 

In this section we provide a detailed analysis of the 
performance of CNN activations on the MIT Indoor 
dataset [71], from which we are able to deduce two im¬ 
portant properties thereof. These two properties are critical 
to the suitability of such activations to form the basis of a 
transaction-based approach. 

We first sample 128 x 128 patches with a stride of 32 
pixels from each image. Then, for each image patch, we ex¬ 
tract the 4096-dimensional non-negative output of the first 
fully-connected layer of BVLC Reference CaffeNet [49]. To 
generate image features, we consider the following three 
strategies. The first strategy is our baseline, which is sim¬ 
ply the outcome of max pooling on CNN activations of all 
patches in an image. The next two strategies are variants of 
the baseline which are detailed as follows. 

1. CNN-Sparsifled. For each 4096-dimensional CNN 
activation of an image patch, we retain the magnitudes 
of only the K largest elements in the vector, setting the 
remaining elements to zero. The feature representation 
for an image is the outcome of applying max pooling 
to the thus revised CNN activations. 

2. CNN-Binarized. For each 4096-dimensional CNN 
activation of an image patch, we set the K largest ele¬ 
ments in the vector to one and the remaining elements 
to zero. The feature representation for an image is the 
outcome of performing max pooling on these binarized 
CNN activations. 

For each strategy we train a multi-class linear SVM clas¬ 
sifier in a one-vs-all fashion. The classification accuracy 
achieved by each of the two above strategies for a range 
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Figure 2. An illustration of the mid-level deep pattern mining process. Given image patches sampled from both the target category (e.g., 
car) and the background class we represent each as a transaction after extracting their CNN activation. Patterns are then discovered by the 
well-known association rule mining. Mid-level visual elements are retrieved from image patches with the same patterns. 


K 

10 

20 

50 

100 

CNN-Sparsified 

50.10 

56.33 

60.34 

61.68 

CNN-Binarized 

54.34 

59.15 

61.35 

61.29 


Table 1. Classification accuracies achieved by the two strategies 
for keeping the K largest magnitudes of CNN activations of image 
patches on the MIT Indoor dataset. Note that our baseline, the 
outcome of max pooling on CNN activations of all patches in an 
image, gives an accuracy of 65.15%. 

of K values is summarized in Table 1. In comparison, our 
baseline method gives an accuracy of 65.15%. Analyzing 
the results in Table 1 leads to two observations of CNN ac¬ 
tivations of fully-connected layers (expect the last classifi¬ 
cation layer): 

1. Sparse. Comparing the performance of “CNN- 
Sparsified” with that of the baseline feature (65.15%), 
it is clear that accuracy is reasonably high when us¬ 
ing sparsified CNN activations with a small fraction of 
non-zero magnitudes out of 4096. 

2. Binary. Comparing “CNN-Binarized” with the 
“CNN-Sparsified” counterpart, it can be seen that 
CNN activations do not suffer from binarization when 
K is small. Accuracy even increases slightly in some 
cases. 

Note that the above properties are also observed in 
recent works on analyzing CNNs [2,26]. 

Conclusion. The above two properties imply that for an 
image patch, the discriminative information within its CNN 
activation is mostly embedded in the dimension indices of 
the K largest magnitudes. 


4.2. Transaction creation 

Transactions must be created before any pattern mining 
algorithm can proceed. In our work, as we aim to discover 
patterns from image patches, a transaction is created for 
each image patch. 

The most critical issue now is how to transform an image 
patch into a transaction while retaining as much information 
as possible. Fortunately the analysis above (Sec. 4.1) illus¬ 
trates that CNN activations are particularly well suited to 
the task. Specifically, we treat each dimension index of a 
CNN activation as an item (4096 items in total). Given the 
performance of the binarized features shown above, each 
transaction is then represented by the dimension indices of 
the K largest elements of the corresponding image patch. 

This strategy satisfies both requirements for applying 
pattern mining techniques (Sec. 3). Specifically, given little 
performance is lost when using a sparse representation of 
CNN activations (‘sparse property’ in Sec. 4.1), each trans¬ 
action calculated as described contains only a small number 
items {K is small). And because binarization of CNN acti¬ 
vations has little deleterious effect on classification perfor¬ 
mance (‘binary property’ in Sec. 4.1), most of the discrim¬ 
inative information within a CNN activation is retained by 
treating dimension indices as items. 

Following the work of [70], at the end of each transac¬ 
tion, we add a pos (or neg) item if the corresponding image 
patch comes from the target category (or the background 
class). Therefore, each complete transaction has K + 1 
items, consisting of the indices of the K largest elements 
in the CNN activation plus one class label. For example, if 
we set K = 3, given a CNN activation of an image patch 
from the target category which has 3 largest magnitudes in 
its 3rd, 100-th and 4096-th dimensions, the corresponding 
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transaction will be {3,100,4096, pos}. 

In practice, we first sample a large number of patches 
from images in both the target category and the background 
class. After extracting their CNN activations, a transaction 
database T) is created, containing a large number of trans¬ 
actions created using the proposed technique. Note that the 
class labels, pos and neg, are represented by 4097 and 4098 
respectively in the transactions. 

4.3. Mining representative and discriminative pat¬ 

terns 

Given the transaction database T> constructed in Sec. 4.2, 
we use the Aprior algorithm [3] to discover a set of patterns 
V through association rule mining. More specifically, Each 
pattern P G V must satisfy the following two criteria: 

SUpp(P) > SUpPjjjin, (3) 

COnf(P —> pos) > confmin, (4) 

where supp^i„ and confmin are thresholds for the support 
value and confidence. 

Representativeness and discriminativeness. We now 

demonstrate how association rule mining implicitly satisfies 
the two requirements of mid-level visual element discovery, 
i.e., representativeness and discriminativeness. Specifically, 
based on Eq. (3) and Eq. (4), we are able to rewrite Eq. (2) 
thus 

supp(P U {pos}) = supp(P) X conf(P —)• pos) 

> supp^in X confmin, 

where supp(P U {pos}) measures the fraction of pattern 
P found in transactions of the target category among all 
the transactions. Therefore, having values of supp(P) and 
conf(P —> pos) larger than their thresholds ensure that pat¬ 
tern P is found frequently in the target category, akin to the 
representativeness requirement. A high value of confmin 
(Eq. (4)) also ensures that pattern P is more likely to be 
found in the target category rather than the background 
class, reflecting the discriminativeness requirement. 

4.4. Retrieving mid-level visual elements 

Given the set of patterns V discovered in Sec. 4.3, find¬ 
ing mid-level visual elements is straightforward. A mid¬ 
level visual element V contains the image patches shar¬ 
ing the same pattern P, which can be retrieved efficiently 
through an inverted index. This process outputs a set of 
mid-level visual elements V (i.e., V G V). 

We provide a visualization of some of the discovered 
mid-level visual elements in Eig. 3. It is clear that image 
patches in each visual element are visually similar and de¬ 
picting the same semantic concept while being discrimina¬ 
tive from other categories. Eor instance, some mid-level 


visual elements catch discriminative parts of objects (e.g., 
cat faces found in the cat category), and some depict typi¬ 
cal objects or people in a category (e.g., horse-rider found 
in the horse category). An interesting observation is that 
mid-level elements discovered by the proposed MDPM al¬ 
gorithm are invariant to horizontal flipping. This is due to 
the fact that original images and their horizontal flipping 
counterparts are fed into the CNN during the pre-training 
process. 

5. Image representation 

To discover patterns from a dataset containing Y cate¬ 
gories, each category is treated as the target category while 
all remaining Y — 1 categories in the dataset are treated as 
the background class. Thus Y sets of patterns will be dis¬ 
covered by the MDPM algorithm, one for each of the Y 
categories. Given the Y sets of patterns and retrieved mid¬ 
level visual elements, we propose two methods to generate 
image feature representations. The first method is to use a 
subset of patterns (Sec. 5.1), whereas the second one relies 
on the retrieved mid-level visual elements (Sec. 5.2). The 
details of both methods are as follows. 

5.1. Encoding an image using patterns 

5.1.1 Pattern selection 

Due the problem of pattern explosion (Sec. 3.3), we first 
select a subset of the discovered patterns based on a sim¬ 
ple criterion. We define the coverage of a pattern and its 
retrieved mid-level visual element as the number of unique 
images that image patches in this element comes from (see 
Pig. 4 for an intuitive example). Then, we rank the patterns 
using the proposed coverage criterion. The intuition here is 
that we aim to find the patterns whose corresponding mid¬ 
level elements cover as many different images as possible, 
resembling the “Purity-Coverage Plot” in [24]. Thus, from 
each category, we select X patterns whose corresponding 
mid-level elements have top-AT coverage values. Then, the 
selected patterns from all Y categories are combined into a 
new set of patterns P which contains X x Y elements in 
total. 

5.1.2 Bag-of-Patterns representation 

To encode a new image using a set of patterns P, we first 
sample image patches at multiple scales and locations, and 
extract their CNN activations. Eor each 4096-dimensional 
CNN activation vector of an image patch, after finding Ci, 
the set of indices of dimensions that have non-zero val¬ 
ues, we check for each selected pattern Pk G P whether 
P Q Ci. Thus, our Bag-of-Patterns representation (BoP for 
short) JboP € is a histogram encoding of the set of 

local CNN activations, satisfying [fBop\k = |{t|Rfc € Ci}\. 
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Figure 3. Mid-level visual elements discovered by our algorithm on the Pascal VOC 2007 dataset (for each category, each row is one 
exemplar). 
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Figure 4. An illustration of the pattern selection process (Sec. 5.1.1). For each pattern on the left, the image patches which form the 
corresponding mid-level visual elements are shown on the right. The red number underneath each patch is the image index. Since the top 
and bottom pattern cover 4 and 5 unique images, the coverage values of them are 4 and 5 respectively. 


Our Bag-of-Patterns representation is similar to the well- 
known Bag-of-Visual-Words (BoW) representation [81] if 
one thinks of a pattern P G V as one visual word. The 
difference is that in the BoW model one local descriptor is 
typically assigned to one visual word, whereas in our BoP 
representation, multiple patterns can fire given on the ba¬ 
sis of a CNN activation (and thus image patch). Note that 
BoP representation has also been utilized by [34] for im¬ 
age retrieval. In practice, we also add a 2-level (1x1 and 
2x2) spatial pyramid [52] when computing the BoP rep¬ 
resentation. More specifically, to generate the final feature 
representation, we concatenate the normalized BoP repre¬ 
sentations extracted from different spatial cells. 


5.2. Encoding an image using mid-level elements 

Due to the redundant nature of the discovered patterns, 
mid-level visual elements retrieved from those patterns are 
also likely to be redundant. 

For the purpose of removing this redundancy, we merge 
mid-level elements that are both visually similar and which 
depict the same visual concept (Sec. 5.2.1). Patch de¬ 
tectors trained from the merged mid-level elements can 
then be used to construct a Bag-of-Elements representation 
(Sec. 5.2.2). 


5.2.1 Merging mid-level elements 

We propose to merge mid-level elements while simultane¬ 
ously training corresponding detectors using an iterative ap¬ 
proach. 

Algorithm 1 summarizes the proposed ensemble merg¬ 
ing procedure. At each iteration, we greedily merge over¬ 
lapping mid-level elements and train the corresponding 
detector through the MergingTrain function in Algo¬ 
rithm 1. In the MergingTrain function, we begin by 
selecting the element covering the maximum number of 
training images, and then train a Linear Discriminant Anal¬ 
ysis (LDA) detector [44]. The LDA detector has the ad¬ 
vantage that it can be computed efficiently using a closed- 
form solution E~^(xp — x) where Xp is the mean of CNN 
activations of positive samples, x and S are the mean and 
covariance matrix respectively which are estimated from a 
large set of random CNN activations. Inspired by previous 
works [50,53,80], We then incrementally revise this detec¬ 
tor. At each step, we run the current detector on the activa¬ 
tions of all the remaining mid-level elements, and retrain it 
by augmenting the positive training set with positive detec¬ 
tions. We repeat this iterative procedure until no more ele¬ 
ments can be added into the positive training set. The idea 
behind this process is using the detection score as a similar¬ 
ity metric, inspired by Exemplar SVM [61,77]. The output 
of the ensemble merging step is a merged set of mid-level 
elements and their corresponding detectors. The limitation 
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Algorithm 1: Ensemble Merging Pseudocode 
Input; A set of partially redundant visual elements V 
Output; A set of clean mid-level visual elements V' 
and corresponding patch detectors D 
Initialize V' Dc 0; 
while V ^ 0 do 

[Vt,c?] ^ MergingTrain(V); 

V ^ V \ Vt; 

U V}-, 
vevt 

D ^ DU{d}; 

end 

return V', D-, 

Function MergingTrain (V) 

Select V* € V which covers the maximum 
number of training images; 

Initialize Vi {F*}, 5 •(— 0 ; 

repeat 

Vi ■<— Vt U S; 

Train EDA detector d using Vt; 

5 ^ {V G V \ Vt|Score(V,d) > Th} where 

Score{V,d) = ^ 

pre-defined threshold); 

until 5 = 0; 
return Vt, d\ 


of the proposed merging method is that the merging thresh¬ 
old Th (see Algorithm 1) needs to be tuned, which will be 
analyzed in the experiment (Sec. 6.2.1). 

After merging mid-level elements, we again use the cov¬ 
erage criterion (Sec. 5.1.1) to select X detectors of merged 
mid-level elements for each of the Y categories and stack 
them together. 


5.2.2 Bag-of-Elements representation 

As shown in previous works on mid-level visual element 
discovery [7, 24, 50, 80], detectors of mid-level elements 
can be utilized to generate a Bag-of-Elements representa¬ 
tion. An illustration of this process is shown in Eig. 5. Con¬ 
cretely, given an image, we evaluate each of the detectors at 
multiple scales, which results in a stack of response maps of 
detection scores. Eor each scale, we take the max score per 
detector per region encoded in a 2-level (1x1 and 2x2) 
spatial pyramid. The final feature representation of an im¬ 
age has A X V X 5 dimensions, which is the outcome of 
max pooling on the responses from all scales in each spatial 
cell. 



scale one 



a new image 


scale two 


maps 


1 


111111111111111111 ■ 

Bag-of-Elements representation 


Figure 5. Pipeline to construct a Bag-of-Elements representation, 
which has been used in previous works as well [7,24,50,80]. 


6. Experiments 

This section contains an extensive set of experimental 
result and summarizes the main findings. Eirstly, some gen¬ 
eral experimental setups {e.g., datasets, implementation de¬ 
tails) are discussed in Sec. 6.1, followed by detailed analysis 
of the proposed approach on object (Sec. 6.2) and indoor 
scene (Sec. 6.3) classification tasks respectively. Rely on 
the discovered mid-level visual elements. Sec. 6.4 provides 
further analysis of the importance of context information 
for recognition, which seldom appears in previous works 
on mid-level elements. 


6.1. Experimental setup 
6.1.1 CNN models 

Eor extracting CNN activations from image patches, we 
consider two state-of-the-art CNN models which are both 
pre-trained on the ImageNet dataset [21]. The first CNN 
model is the BVLC Reference CajfeNet [49] (CaffeRef for 
short), whose architecture is similar to that of AlexNet [51], 
that is, five convolution layers followed by two 4096- 
dimensional and one 1000-dimensional fully-connected 
layers. The second CNN model is the 19-layer VGG- 
VD model [79] which has shown good performance in 
the ILSVRC-2014 competition [75]. Eor both models, we 
extract the non-negative 4096-dimensional activation from 
the first fully-connected layer after the rectified linear unit 
(ReLU) transformation as image patch representations. 
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6.1.2 Datasets 

We evaluate our approach on three publicly available image 
classihcation datasets, two for generic object classibcation 
and the other for scene classihcation. The details of the 
datasets are as follows. 

Pascal VOC 2007 dataset. The Pascal VOC 2007 
dataset [28, 29] contains a total of 9,963 images from 20 
object classes, including 5,011 images for training and val¬ 
idation, and 4,952 for testing. For evaluating different al¬ 
gorithms, mean average precision (mAP) is adopted as the 
standard quantitative measurement. 

Pascal VOC 2012 dataset. The Pascal VOC 2012 
dataset [28, 29] is an extension of the VOC 2007 dataset, 
which contains a total of 22,531 images from 20 object 
classes, including 11,540 images for training and valida¬ 
tion, and 10,991 for testing. We use the online evaluation 
server of this dataset to evaluate the proposed approach. 

MIT Indoor dataset. The MIT Indoor dataset [71] con¬ 
tains 67 classes of indoors scenes. A characteristic of in¬ 
door scenes is that unique configurations or objects are of¬ 
ten found in a particular scene, e.g., computers are more 
likely to be found in a computer room rather than a laun¬ 
dry. For this reason, many mid-level element discovery al¬ 
gorithms [9,24,50,80,83] are evaluated on this dataset and 
have achieved state-of-the-art performance. We follow the 
standard partition of [71], i.e., approximately 80 training 
and 20 test images per class. The evaluation metric for MIT 
Indoor dataset is the mean classihcation accuracy. 

6.1.3 Implementation details 

Given an image, we resize its smaller dimension to 256 
while maintaining its aspect ratio, then we sample 128 x 128 
patches with a stride of 32 pixels, and calculate the CNN 
activations from Caffe (using either the CajfeRef or VGG- 
VD models). When mining mid-level visual elements, only 
training images are used to create transactions (trainval 
set for Pascal VOC datasets). The length of each is transac¬ 
tion is set as 20, which corresponds to 20 largest dimension 
indices of CNN activations of an image patch. We use the 
implementation of association rule mining from [8]^. The 
merging threshold Th in Algorithm 1 (Sec. 5.2.1) is set as 
150. For generating image features for classihcation, CNN 
activations are extracted from hve scales for the Pascal VOC 
datasets as compared to three scales for the MIT Indoor 
dataset (we experimentally found using more than three 
scales for MIT Indoor does not improve the overall classi¬ 
hcation performance. ) . For training image classihers, we 
use the Liblinear toolbox [30] with 5-fold cross validation. 
For association rule mining, the value of supomin (Eq- 3) is 

^http://www.borgelt.net/apriori.html 



Number of mid-level elements per category 

(a) 

(b) 

Figure 6. Performance of proposed feature encoding methods on 
the Pascal VOC 2007 dataset. Note that VGG-VD model is used 
for evaluation. 

always set as 0.01% whereas the value of confmin (Eq. 4) 
is tuned for different datasets. 

6.2. Object classification 

In this section, we provide a detailed analysis of the pro¬ 
posed system for object classihcation on the Pascal VOC 
2007 and 2012 datasets. We begin with an ablation study 
which illustrates the importance of the different components 
of our system (Sec. 6.2.1). In Sec. 6.2.2, we compare our 
system with state-of-the-art algorithms which also rely on 
CNNs, followed by computational complexity analysis in 
Sec. 6.2.4. Some visualizations of mid-level visual ele¬ 
ments are provided in Sec. 6.2.3. On VOC 2007 dataset, 
the confmin (Eq. 4) is set as 60% for CaffRef and 80% for 
VGG-VD model respectively. On VOC 2012 dataset, we use 
40% for confmin when VGG-VD model is adopted. 

6.2.1 Ablation study 

Bag-of-Elements vs. Bag-of-Patterns. We analyze the 
performance achieved by different encoding methods pro¬ 
posed in Sec. 5. We denote the the Bag-of-Patterns repre¬ 
sentation as BoP, and the Bag-of-Elements representation 
constructed after the merging process as BoE-M. We also 
implement another encoding method, BoE-S which does 
not merge mid-level elements but rather select mid-level el¬ 
ements from a large pool of candidates using the coverage 
criterion. The performance of the above encoding methods 
are illustrated in Eig. 6. 

As is illustrated in Eig. 6, when using the same number 
of mid-level elements and the same CNN model, the Bag- 
of-Elements representation signihcantly outperforms the 
Bag-of-Patterns representation. This could be interpreted as 
resulting from the “hard-assignment” process at the heart of 
the Bag-of-Patterns method. In contrast, Bag-of-Elements 
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transaction length 

10 

20 

30 

mAP (%) 

85.4 

87.3 

87.6 


Table 2. Analysis of the transaction length on the VOC 2007 
dataset using the VGG-VD model. Other parameters are frozen. 


Th 

50 

100 

150 

200 

mAP (%) 

86.4 

87.2 

87.3 

87.0 


Table 3. Analysis of the merging threshold Th in Algorithm 1 on 
the VOC 2007 dataset using the VGG-VD model. Other parame¬ 
ters are frozen. 

does not suffer from this problem because it relies on 
the detection responses of the patch detectors. Compared 
with direct selection of mid-level elements, performance 
is consistently boosted when mid-level elements are first 
merged (BoE-M vi. BoE-S), which shows the importance 
of the proposed merging algorithm (c.f. Algorithm 1). 
Therefore, we use our best encoding method, BoE-M, to 
compare with the state-of-the-art below (note that the suffix 
is dropped). 

Number of mid-level elements. Irrespective of the CNN 
architecture or encoding method, adding more mid-level el¬ 
ements or patterns to construct image features consistently 
improves classification accuracy (see Eig. 6). Note also 
that the performance gain is large when a small number 
of mid-level elements (patterns) are used (e.g., from 10 to 
20 ), and seems to saturate when the number of mid-level 
elements reaches 50. This is particularly interesting given 
the differences between the datasets and the CNN networks 
used. 

Transaction length. We evaluate the performance of our 
approach under three settings of the transaction length, 
which are 10, 20 and 30 respectively. Table 2 depicts the 
results. It is clear from Table 2 that more information will 
be lost when using a smaller transaction length. However, 
as the search space of the association rule mining algorithm 
grows exponentially with the transaction length, this value 
cannot be set very large or otherwise it becomes both time 
and memory consuming. Therefore, we opt for 20 as the 
default setting for transaction length as a tradeoff between 
performance and time efficiency. 

The merging threshold. The merging threshold Th in 
Algorithm 1 controls how many mid-level elements should 
be merged together. While keeping other parameters fixed, 
we evaluate this parameter under different settings. As 
shown in Table 3, the best performance is reached when 
using value of 150 for Th. 


Pattern selection method in [74]. To show the effec¬ 
tiveness of the proposed pattern selection (Sec. 5.1.1) and 
merging (Sec. 5.2.1) methods, we re-implemented the 
pattern selection method proposed by [74] and combine 
it with our framework. In [74], patterns are first ranked 
according to an interesting score and then non-overlapping 
patterns are selected in a greedy fashion (please refer to 
Algorithm 1 in [74]). In our case, after selecting patterns 
following [74], we train detectors for the mid-level 
elements retrieved from those patterns and construct a 
Bag-of-Elements representation (Sec. 5.2.2). On the VOC 
2007 dataset, when using the VGG-VD model and 50 
elements per category, this framework gives 85.0% mAP, 
which is lower than that of our pattern selection method 
(86.2%) and pattern merging method (87.3%). 


6.2.2 Comparison with state-of-the-arts 

To compare with the state-of-the-art we use the BoE repre¬ 
sentation with 50 mid-level elements per category, which 
demonstrated the best performance in the ablation study 
(Eig. 6). We also consider one baseline method (denoted 
as ‘EC’) in which a 4096-dimensional fully-connected acti¬ 
vation extracted from a global image is used as the feature 
representation. Table 4 summarizes the performance of our 
approach as well as state-of-the-art approaches on Pascal 
VOC 2007. 

Eor encoding high-dimensional local descriptors, [58] 
propose a new variant of Eisher vector encoding [68]. 
When the same CajfeRef model is used in both meth¬ 
ods, our performance is on par with that of [58] (76.4% 
vs. 76.9%) whereas the feature dimension is 40 times lower 
(5k vs. 200k). [64] adds two more layers on the top of 

fully-connected layers of the AlexNet and fine-tunes the 
pre-trained network on the PASCAL VOC. Although the 
method performs well (77.7%), it relies on bounding box 
annotations which makes the task easier. The EV-CNN 
method of [18] extracts dense CNN activations from the 
last convolutional layer and encodes them using the classic 
Eisher vector encoding. Using the same VGG-VD model, 
our BoE representation performs better than this method by 
a noticeable margin (87.3% vs. 84.9%, despite the fact that 
we only use half of the image scales of EV-CNN (5 vs. 10) 
and feature dimension is significantly lower (5k vs. 65k). 

As for the VOC 2012 dataset, as shown in Table 5, when 
using the VGG-VD CNN model and 50 elements per cat¬ 
egory, the proposed BoE representation reaches a mAP of 
85.5%, outperforming most state-of-the-art methods. 

6.2.3 Visualizing mid-level visual elements 

We visualize some mid-level elements discovered by the 
proposed MDPM algorithm and their firings on test images 
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45.9 

81.7 
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55.2 
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80.1 

70.9 

95.2 
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[73] 

88.5 

81.0 

83.5 

82.0 

42.0 

72.5 

85.3 

81.6 

59.9 

58.5 

66.5 

77.8 

81.8 

78.8 

90.2 

54.8 

71.1 

62.6 

87.4 

71.8 

73.9 

[58] 

89.5 

84.1 

83.7 

83.7 

43.9 

76.7 

87.8 

82.5 

60.6 

69.6 

72.0 

77.1 

88.7 

82.1 

94.4 

56.8 

71.4 

67.7 

90.9 

75.0 

76.9 

[14] 

95.3 

90.4 

92.5 

89.6 

54.4 

81.9 

91.5 
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64.1 

76.3 

74.9 

89.7 

92.2 
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95.2 

60.7 

82.9 

68.0 

95.5 

74.4 

82.4 
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88.5 
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85.5 
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95.6 
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58.0 

90.4 

77.9 
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[45] 

91.9 

88.6 

91.2 

89.5 

63.0 

81.8 

88.7 

90.1 

62.7 

79.6 

72.8 

88.7 

90.0 

85.8 

93.5 

63.8 

88.4 

68.1 

92.1 

78.7 

82.4 

[94] 

95.1 

90.1 

92.8 

89.9 

51.5 

80.0 

91.7 

91.6 

57.7 

77.8 

70.9 
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85.2 
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94.4 
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[17] 
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88.9 
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91.1 

90.7 
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79.8 
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90.1 

90.8 

88.6 

94.7 

67.7 

83.5 

78.6 

92.9 

82.2 

84.9 

[79] 

. 

- 

, 

- 

- 

- 

- 

- 

- 

- 

- 

- 

- 

- 

- 

- 

- 

- 

- 

- 

89.3 

BoE {CajfeRef,SQi) 

90.3 

85.4 

82.9 

79.8 

45.9 

75.5 

89.6 

85.1 

61.6 

60.0 

71.2 

79.9 

88.9 

83.4 

94.2 

53.3 

65.9 

67.2 

91.4 

76.0 

76.4 

BoE (VGG-VA50) 

97.2 

93.3 

95.0 

91.3 

63.3 

88.2 

93.0 

94.1 

70.5 

79.9 

85.6 

93.2 

94.4 

90.4 

95.4 

70.1 

87.7 

78.3 

97.2 

87.0 

87.3 


Table 4. Comparison of classification results on the Pascal VOC 2007 dataset. For the sake of fair comparison, CNN models of all above 
methods are trained using the dataset used in the ILSVRC competition [75], i.e., 1000 classes from the ImageNet [21]. 
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Table 5. Comparison of classification results on the Pascal VOC 2012 dataset. For the sake of fair comparison, CNN models of all above 
methods are trained using the dataset used in the ILSVRC competition [75], i.e., 1000 classes from the ImageNet [21]. 


of the VOC 2007 dataset in Fig. 7. 

Clearly, some mid-level visual elements capture discrim¬ 
inative parts of an object (e.g., horse faces for the horse 
class, the front of locomotives for the train class and wheels 
for the motorbike class). It is worth noting here these dis¬ 
criminative parts have been shown to be extremely impor¬ 
tant for state-of-the-art object recognition systems, such as 
Deformable Part Models [31] and Poselets [12]. Moreover, 
rather than firing on the underlying object, some mid-level 
elements focus on valuable contextual information. For in¬ 
stance, as shown in Fig. 7, ‘people’ is an important cue both 
for the horse and motorbike classes, and ‘coastline’ is cru¬ 
cial for classifying boat. This fact indicates that mid-level 
elements may be a good tool for analysing the importance 
of context for image classification (Sec. 6.4). 

6.2.4 Computational complexity 

The effectiveness of any mid-level visual element discovery 
process depends on being able to process very large num¬ 
bers of image patches. The recent work of [67], for exam¬ 
ple, takes 5 days to find mid-level elements on the MIT In¬ 
door dataset. The proposed MDPM algorithm has been de¬ 
signed from the beginning with speed in mind, as it is based 
on a very efficient pattern mining algorithm. Thus, for ap¬ 
proximately 0.2 million transactions created from CNN ac¬ 
tivations of image patches on the Pascal VOC 2007 dataset, 
association rule mining takes only 23 seconds to discover 
representative and discriminative patterns. The bottleneck 


of our approach thus lies in the process of extracting CNN 
activations from image patches, which is slower than the 
calculation of hand-crafted HOG features. All CNN-based 
approaches will suffer this time penalty, of course. How¬ 
ever, the process can be sped up using the technique pro¬ 
posed in [96] which avoids duplicated convolution opera¬ 
tions between overlapping image patches. GPUs can also 
be used to accelerate CNN feature extraction. 

6.3. Scene classification 

We now provide detailed analysis of the proposed sys¬ 
tem for the task of scene classification on the MIT Indoor 
dataset. As many mid-level element discovery algorithms 
have reported performance on this dataset, we first provide 
a comprehensive comparison between these algorithms and 
our method in Sec. 6.3.1. The comparison between the per¬ 
formance of state-of-the-art methods with CNN involved 
and ours are presented in Sec. 6.3.2. Finally, we visual¬ 
ize some mid-level elements discovered by the proposed 
MDPM algorithm and their firings in Sec. 6.3.3. For this 
dataset, the value of confmin (Eq. 4) is always set as 30%. 

6.3.1 Comparison with methods using mid-level ele¬ 
ments 

As hand-crafted features, especially HOG, are widely uti¬ 
lized as image patch representations in previous works, we 
here analyze the performance of previous approaches if 
CNN activations are used in place of their original feature 
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Figure 7. Discovered mid-level visual elements and their corresponding detections on test images on the Pascal VOC 2007 dataset. 


Method 

jj of elements 

Acc (%) 

[80] 

210 

38.10 

[50] 

50 

46.10 

[54] 

20 

46.40 

[92] 

11 

50.15 

[83] 

73 

51.40 

[9] 

50 

54.40 

[24] 

200 

64.03 

[67] 

5 

73.30 

EDA-Retrained {CajfeRef) 

20 

58.78 

EDA-Retrained (CajfeRef) 

50 

62.30 

EDA-KNN (CajfeRef) 

20 

59.14 

EDA-KNN (CajfeRef) 

20 

63.93 

BoE (CajfeRef) 

20 

68.24 

BoE (CajfeRef) 

50 

69.69 

BoE (VGG-VD) 

20 

76.93 

BoE (VGG-VD) 

50 

77.63 


Table 6. Classification results of mid-level visual element discov¬ 
ery algorithms on MIT Indoor dataset. 


types. We have thus designed two baseline methods so as to 
use CNN activations as an image patch representation. 

The first baseline “LDA-Retrained” initially trains Ex¬ 
emplar EDA using the CNN activation of a sampled patch 
and then re-trains the detector 10 times by adding top-10 
positive detections as positive training samples at each iter¬ 
ation. This is similar to the “Expansion” step of [50]. The 
second baseline “LDA-KNN” retrieves 5-nearest neighbors 
of an image patch and trains an EDA detector using the 
CNN activations of retrieved patches (including itself) as 
positive training data. For both baselines, discriminative 


detectors are selected based on the Entropy-Rank Curves 
proposed by [50]. 

As shown in Table 6, when using the CajfeRef model, 
MDPM achieves significantly better results than both base¬ 
lines in the same setting. This attests to the fact that the 
pattern mining approach at the core of MDPM is an impor¬ 
tant factor in its performance. 

We also compare the proposed method against recent 
work in mid-level visual element discovery in Table 6. 
Clearly, by combining the power of deep features and pat¬ 
tern mining techniques, the proposed method outperforms 
all previous mid-level element discovery methods by a size¬ 
able margin. 

6.3.2 Comparison with methods using CNN 

In Table 7, we compare the proposed method to others 
in which CNN activations are used, at the task of scene 
classification. The baseline method, using fully-connected 
CNN activations extracted from the whole image using Caf- 
feRef (resp. VGG-VD), gives an accuracy of 57.74% (resp. 
68.87%). The proposed method achieves 69.69% using 
CajfeRef and 77.63% using VGG-VD, which are significant 
improvements over the corresponding baselines. 

Our method is closely related to [42] and [57] in the 
sense that all rely on off-the-shelf CNN activations of im¬ 
age patches. Our BoE representation, which is based on 
mid-level elements discovered by the MDPM algorithm, not 
only outperforms [42] on 128 x 128 and 64 x 64 patches by 
a considerable margin (69.69% vs. 65.52% and 69.69% vs. 
62.24%), it also slightly outperforms that of [57] (69.69% 
vs. 68.20%). Our performance (77.63%) is also comparable 
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Figure 8. Discovered mid-level visual elements and their corresponding detections on test images on the MIT Indoor dataset. 


to that of the recent works of bilinear CNN [56] (77.55%) 
and its compact version [37] (76.17%) when the VGG-VD 
model is adopted. 

Fine-tuning has been shown to be benehcial when trans¬ 
ferring pre-trained CNN models to another dataset [2,40, 
64]. We are interested in how the performance changes if 
a hne-tuned CNN model is adopted in our framework. For 
this purpose, we first fine-tuned the VGG-VD model on the 
MIT Indoor dataset with a learning rate of 0.0005. The 
fine-tuned model reaches 69.85% accuracy after 70k iter¬ 
ations. After applying the hne-tuned model in our frame¬ 
work, the proposed approach reaches 71.82% accuracy, 
which is lower than the case of using a pre-trained model 
(77.63%) but still improves the accuracy of directly hne- 
tuning (69.85%). The underlying reason is probably due to 
the small training data size of the MIT Indoor dataset and 


the large capacity of the VGG-VD model. We plan to inves¬ 
tigate this issue in our future work. Similar observation was 
made in [37]. 


6.3.3 Visualizing mid-level visual elements 

We visualize some visual elements discovered and their br¬ 
ings on test images of the MIT Indoor dataset in Fig. 8. It is 
intuitive that the discovered mid-level visual elements cap¬ 
ture the visual patterns which are often repeated within a 
scene category. Some of the mid-level visual elements re¬ 
fer to frequently occurring object conhgurations, e.g., the 
conhguration between table and chair in the meeting room 
category. Some instead capture a particular type of object 
in the scene, such as washing machines in the laundromat. 
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VOC 2007 

aero 

bike 

bird 

boat 

bottle 

bus 

car 

cat 

chair 

cow 

table 

dog 

horse 

mbike 

person 

plant 

sheep 

sofa 

train 

tv 

Average 

gt-object 

95.0 

100.0 

90.0 

65.0 

43.7 

100.0 

100.0 

100.0 

55.0 

87.5 

100.0 

100.0 

100.0 

100.0 

40.0 

88.2 

90.0 

100.0 

100.0 

95.0 

87.5 

object-context 

0.0 

0.0 

0.0 

0.0 

56.3 

0.0 

0.0 

0.0 

45.0 

0.0 

0.0 

0.0 

0.0 

0.0 

60.0 

5.9 

0.0 

0.0 

0.0 

0.0 

8.3 

scene-context 

5.0 

0.0 

10.0 

35.0 

0.0 

0.0 

0.0 

0.0 

0.0 

12.5 

0.0 

0.0 

0.0 

0.0 

0.0 

5.9 

10.0 

0.0 

0.0 

5.0 

4.2 


Table 8. Firing types of the top-20 mid-level elements on the Pascal VOC 2007 dataset (VGG-VD model adopted). 


Method 

Acc (%) 

Comments 

FC (CajfeRef) 

57.74 

CNN for whole image 

FC (VGG-VD) 

68.87 

CNN for whole image 

[73] 

58.40 

Over Feat toolbox 

[42] (CajfeRef) 

68.88 

concatenation 

[6] 

65.90 

jittered CNN 

[6] 

66.30 

ETCNN 

[100] 

68.24 

Places dataset used 

[58] (CaffeRef) 

68.20 

new Eisher encoding 

[57] (CaffeRef) 

68.80 

cross-layer pooling 

[67] 

73.30 

unified pipeline 

[56] (VGG-VD) 

77.55 

Bilinear CNN 

[37] (VGG-VD) 

76.17 

compact Bilinear CNN 

[63] 

77.40 

shared parts 

[17] (VGG-VD) 

81.00 

Fisher Vector 

BoE (CajfeRef) 

69.69 

50 elements 

BoE (VGG-VD) 

77.63 

50 elements 


Table 7. Classification results of methods using CNN activations 
on MIT Indoor dataset. 

6.4. Do mid-level visual elements capture context? 

It is well known that humans do not perceive every in¬ 
stance in the scene in isolation. Instead, context information 
plays an important role [16, 23,46, 59, 60, 85]. In the our 
scenario, we consider how likely that the discovered mid¬ 
level visual elements fire on context rather than the under¬ 
lying object. In this section, we give answer to this question 
based on the Pascal VOC07 dataset which has ground truth 
bounding boxes annotations. 

6.4.1 Object and scene context 

We first need to define context qualitatively. For this pur¬ 
pose, we leverage the test set of the segmentation challenge 
of the Pascal VOC 2007 dataset in which per-pixel labeling 
is available. Given a test image of a given object category, 
its ground-truth pixels annotations S are categorized into 
the following three categories, 

• Sgt'. pixels belong to the underlying object category. 

• Sot- pixels belong to any of the rest 19 object cate¬ 
gories. 

• Ssc- pixels belong to none of the 20 object categories, 
i.e., belong to the background. 



Figure 9. An illustration of the three firing types of mid-level ele¬ 
ments. In the image, ground-truth object instances of the underly¬ 
ing category (e.g., “person”) are overlaid in green while instances 
of other categories (e.g., “cow”) are overlaid in red. Obviously, the 
firing (1) fires on the ground-truth object while firings (2) and (3) 
belong to object and scene context respectively. 

overlap ratio for each of the three types of pixels, 

^ _ \BnSgt\ ^ _ \BnSot\ ^ 

^gt- 1^1 ,Uot- 1^1 1^1 , 

(6) 

where | • | measures cardinality. Note that Ogt+Oot+Ogc = 
1. By comparing the three types of overlap ratios, we can 
easily define three firing types, which includes two types of 
context firing and one ground-truth object firing, 

• Scene context; if Osc > 0.9. 

• Object context; if Osc <= 0.9 and Oot > Ogt- 

• Ground-trutb object; if Osc <= 0.9 and Oot < Oht- 

Fig. 9 depicts a visual example of the three firing types. 
In practice, for each image in the test set, we collect the 
predicted bounding box with the maximum detection score 
if there exists any positive detections (larger than a thresh¬ 
old), followed by categorizing it into one of the three types 
based on Eq. 6. Thus, a mid-level visual element is catego¬ 
rized into the three firing types based on its major votes of 
positive detections. 

6.4.2 Analysis 

Following the context definition in Sec. 6.4.1, we categorize 
the each of the top-20 discovered mid-level visual elements 
of each category of the Pascal VOC 2007 dataset into one of 


Accordingly, given a firing (i.e., predicted bounding box) B 
of a mid-level visual element on an image, we compute an 
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Figure 10. Detections of two object-context mid-level visual elements discovered from the “person” category on the Pascal VOC 2007 
dataset. 


the three categories: gt-object, object or scene context. The 
distribution of this categorization is illustrated in Table 8. 

Interestingly, for many classes, the majority of the dis¬ 
covered mid-level visual elements fires on the underlying 
object, and context information seems to be less important. 
More specihcally, as shown in Table 8, mid-level visual ele¬ 
ments in 10 out of 20 classes never capture context informa¬ 
tion, which reflects image patches capture context in these 
classes are neither representative nor discriminative. On av¬ 
erage, more than 87% mid-level visual element capture the 
underlying object across all the categories. 

We also observe that contextual information from other 
object categories plays a important role for discovering mid¬ 
level visual element from person{60.0%), bottle(56.3%) 
and chair(A5.0%). Fig. 10 shows two examples of object- 
context mid-level visual elements discovered from class 
person. 

As depicted in Table 8, most categories have very low 
proportion of scene-context mid-level visual elements ex¬ 
cept for boat, which has a relatively high value of 35%. 

We also compare distributions of mid-level elements dis¬ 
covered using different CNN models (Fig. 11). As shown 
in Fig. 11, for both CNN models, the majority consists of 
those mid-level elements tend to capture parts of ground- 
truth objects and contextual ones only constitute a relatively 
small fraction. Also, for mid-level visual elements captur¬ 
ing ground-truth objects, the fraction of those discovered 
from the VGG-VD model bypasses that from the CajfeRef 
model by 14% (88% vi.74%). We thus conjecture that for 
image classification, deeper CNNs will more likely to learn 
to represent the underlying objects and contextual informa¬ 
tion may not be that valuable. 

7. Discussion 

Recently, some works on accelerating CNNs [19,72] ad¬ 
vocate using binary activation values in CNNs. It would 



obj-context Hgt-obj Hscene-context 


Figure 11. Distributions of mid-level visual elements discov¬ 
ered using different CNN models —CajfeRef (left) and VGG-VD 
(right). 

be interesting to try binary CNN features for creating trans¬ 
actions. In this case, for an image patch, all of its CNN 
dimensions with positive activation values will be kept to 
generate on transaction. This means we do not need to se¬ 
lect the K largest activation magnitudes as in the current 
approach (Sec.4.2), and there will be no information loss 
for transaction creation at all. 

As the feature dimension of the Bag-of-Elements repre¬ 
sentation (Sec. 5.2.2) is proportion to the number of cate¬ 
gories, most of the current works on mid-level visual ele¬ 
ments, including ours, cannot be applied to image classifi¬ 
cation datasets which contain a huge number of categories 
(e.g., ImageNet [21] and Places [100]). A good indication 
of future work to address this scalability issue may be using 
shared mid-level visual elements [63]. 

8. Conclusion and future work 

We have addressed the task of mid-level visual element 
discovery from the perspective of pattern mining. More 
specifically, we have shown that CNN activation can be 
encoded into transactions, the data structure used by exist¬ 
ing pattern mining techniques which can be readily applied 
to discover discriminative mid-level visual element candi¬ 
dates. We further develop different strategies to generate 
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image representations from the mined visual element candi¬ 
dates. We experimentally demonstrate the effectiveness of 
the mined mid-level visual elements and achieve the state- 
of-the-art classification performance on various datasets by 
using the generated image representation. 

Although this paper only addresses the image classifica¬ 
tion problem, our method can be extended to many other 
applications and serves as a bridge between visual recogni¬ 
tion and pattern mining research fields. Since the publica¬ 
tion of our conference paper [55], there have been several 
works [22,65] which follow our approach to develop meth¬ 
ods suited for various applications, including human action 
and attribute recognition [22] and modeling visual compat¬ 
ibility [65]. 

In future work, we plan to investigate three directions 
to extend our approach. Firstly, we will develop efficient 
mining methods to mine the patterns that are shared across 
categories. This will address the limitation of the current 
method that it can only detect discriminative patterns for 
each category and thus is not very scalable to a dataset with 
a huge number of categories, e.g., ImageNet. Secondly, we 
will extend our method to the metric learning setting. In 
such a setting, the mined discriminative patterns are only 
used to make a binary decision, that is, whether the input 
two images are from the same category. Finally, we will 
apply our method to more applications, especially those that 
can leverage the state-of-the-art pattern mining techniques. 
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